Skip to content

refactor: remove sync engine#767

Open
andreatgretel wants to merge 2 commits into
andreatgretel/fix/remove-allow-resizefrom
andreatgretel/refactor/remove-sync-engine
Open

refactor: remove sync engine#767
andreatgretel wants to merge 2 commits into
andreatgretel/fix/remove-allow-resizefrom
andreatgretel/refactor/remove-sync-engine

Conversation

@andreatgretel

Copy link
Copy Markdown
Contributor

📋 Summary

Removes the legacy sync dataset-builder engine and the DATA_DESIGNER_ASYNC_ENGINE opt-out so generation always uses the async scheduler. This is stacked on #766.

🔗 Related Issue

Stacked on #766.

🔄 Changes

  • Removed the sync DatasetBuilder batch loop, fan-out helpers, resume path, feature flag module, and obsolete DatasetBatchManager.
  • Defaulted DataDesigner and ResourceProvider client concurrency to async for generation/check_models.
  • Updated tests and docs to describe async-only row-group execution.

🧪 Testing

  • make test passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Ran:

  • .venv/bin/ruff check --fix .
  • .venv/bin/ruff format .
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/test_readiness.py packages/data-designer/tests/interface/test_data_designer.py
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py packages/data-designer-engine/tests/engine/resources/test_resource_provider.py packages/data-designer-engine/tests/engine/models/clients/test_factory.py
  • git diff --check

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@andreatgretel andreatgretel requested a review from a team as a code owner June 23, 2026 20:34
@github-actions

Copy link
Copy Markdown
Contributor

Fern preview: https://nvidia-preview-pr-767.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR removes the legacy sync dataset-builder engine and the DATA_DESIGNER_ASYNC_ENGINE opt-out flag, making the async scheduler the sole execution path. It is stacked on #766 and is a cleanup that was pre-announced as pending removal.

  • Deletes flags.py, DatasetBatchManager, _run_batch, _build_with_resume, _fan_out_with_threads, _fan_out_with_async, and all conditional sync/async branches from DatasetBuilder; RowGroupBufferManager is now the only buffer abstraction.
  • Updates DataDesigner, ResourceProvider, and run_readiness_check to always pass ClientConcurrencyMode.ASYNC (with the intentional exception of get_models, which explicitly uses SYNC since it does not run the scheduler).
  • Refreshes documentation, test fixtures, and the benchmark script to reflect the async-only world; test assertions for resume, error surfacing, and progress tracking are updated to match async-engine semantics.

Confidence Score: 5/5

Safe to merge. The sync engine removal is complete and consistent across all layers — engine, interface, config, tests, docs, and benchmark.

Every reference to the old sync execution path has been removed consistently. The async scheduler was already the default and well-tested; this PR simply cuts the fallback with no new logic introduced.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Core orchestrator: removes ~330 lines of sync-only code. build() and build_preview() now unconditionally dispatch to _build_async / _build_async_preview; _recover_progress_from_disk always passes allow_holes=True.
packages/data-designer-engine/src/data_designer/engine/flags.py File deleted. DATA_DESIGNER_ASYNC_ENGINE env var removed; every consumer updated to use hardcoded ASYNC mode or removed.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py File deleted. DatasetBatchManager removed; RowGroupBufferManager is the sole buffer abstraction going forward.
packages/data-designer/src/data_designer/interface/data_designer.py _resolve_client_concurrency_mode removed; _create_resource_provider always passes ClientConcurrencyMode.ASYNC by default, with get_models explicitly passing SYNC.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py get_generation_strategy now coerces via GenerationStrategy(self.config.generation_strategy) ensuring the return type is always an enum member.
packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py Default client_concurrency_mode simplified to ClientConcurrencyMode.ASYNC unconditionally.
scripts/benchmarks/benchmark_engine_v2.py Benchmark simplified to async-only; ArtifactStorage import path corrected from engine.dataset_builders to engine.storage.
packages/data-designer-engine/tests/engine/conftest.py Added mock_model_registry.request_admission = None so async scheduler does not get a truthy Mock for admission control in tests.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User
    participant DataDesigner
    participant DatasetBuilder
    participant AsyncTaskScheduler
    participant RowGroupBufferManager
    participant ArtifactStorage

    User->>DataDesigner: create(num_records)
    DataDesigner->>DatasetBuilder: build(num_records, buffer_size)
    DatasetBuilder->>DatasetBuilder: run_readiness_check(ASYNC mode)
    DatasetBuilder->>DatasetBuilder: _load_resume_state (if ResumeMode.ALWAYS)
    DatasetBuilder->>DatasetBuilder: _build_async(generators, num_records, buffer_size)
    DatasetBuilder->>AsyncTaskScheduler: run(ExecutionGraph, row_groups)
    loop Per row group (concurrent)
        AsyncTaskScheduler->>RowGroupBufferManager: allocate row group
        AsyncTaskScheduler->>AsyncTaskScheduler: FairTaskQueue.select_next()
        AsyncTaskScheduler->>AsyncTaskScheduler: TaskAdmissionController.acquire_lease()
        AsyncTaskScheduler->>AsyncTaskScheduler: _run_cell / _run_batch (per column task)
        AsyncTaskScheduler->>RowGroupBufferManager: checkpoint_row_group()
        RowGroupBufferManager->>ArtifactStorage: write parquet + metadata
    end
    AsyncTaskScheduler-->>DatasetBuilder: telemetry + task traces
    DatasetBuilder->>DatasetBuilder: ProcessorRunner.run_after_generation()
    DatasetBuilder-->>DataDesigner: final dataset path
    DataDesigner-->>User: DatasetCreationResults
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User
    participant DataDesigner
    participant DatasetBuilder
    participant AsyncTaskScheduler
    participant RowGroupBufferManager
    participant ArtifactStorage

    User->>DataDesigner: create(num_records)
    DataDesigner->>DatasetBuilder: build(num_records, buffer_size)
    DatasetBuilder->>DatasetBuilder: run_readiness_check(ASYNC mode)
    DatasetBuilder->>DatasetBuilder: _load_resume_state (if ResumeMode.ALWAYS)
    DatasetBuilder->>DatasetBuilder: _build_async(generators, num_records, buffer_size)
    DatasetBuilder->>AsyncTaskScheduler: run(ExecutionGraph, row_groups)
    loop Per row group (concurrent)
        AsyncTaskScheduler->>RowGroupBufferManager: allocate row group
        AsyncTaskScheduler->>AsyncTaskScheduler: FairTaskQueue.select_next()
        AsyncTaskScheduler->>AsyncTaskScheduler: TaskAdmissionController.acquire_lease()
        AsyncTaskScheduler->>AsyncTaskScheduler: _run_cell / _run_batch (per column task)
        AsyncTaskScheduler->>RowGroupBufferManager: checkpoint_row_group()
        RowGroupBufferManager->>ArtifactStorage: write parquet + metadata
    end
    AsyncTaskScheduler-->>DatasetBuilder: telemetry + task traces
    DatasetBuilder->>DatasetBuilder: ProcessorRunner.run_after_generation()
    DatasetBuilder-->>DataDesigner: final dataset path
    DataDesigner-->>User: DatasetCreationResults
Loading

Reviews (2): Last reviewed commit: "fix: address sync removal review" | Re-trigger Greptile

@johnnygreco johnnygreco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this one, @andreatgretel — removing the old engine is a big simplification, and the core builder path looks much easier to reason about now.

Summary

This PR removes the legacy sync dataset-builder path and makes generation/check-model readiness use the async scheduler and async model clients by default. The implementation matches the stated intent for generation, but I found one public helper regression around direct model use and one stale public doc reference to the removed opt-out.

Findings

Warnings — Worth addressing

Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.

packages/data-designer/src/data_designer/interface/data_designer.py:690get_models() now returns facades whose sync methods are unusable

  • What: _create_resource_provider() now always passes ClientConcurrencyMode.ASYNC, and get_models() calls that helper unchanged at lines 617-619. Those facades are plain ModelFacades, not the _AsyncBridgedModelFacade used inside async custom columns, so a direct model.generate() call goes through ModelRequestExecutor.completion() to an async-mode HTTP client and hits SyncClientUnavailableError from _get_sync_client().
  • Why: get_models() is a public helper specifically documented for testing custom-column functions outside the full pipeline, and the docs still show result = my_generator({"name": "Alice"}, None, models) where the generator examples call models["..."].generate(...). This breaks that workflow even though the sync dataset engine removal itself is intentional.
  • Suggestion: Could we keep get_models() on sync-mode clients, e.g. let _create_resource_provider() accept a client_concurrency_mode override and pass ClientConcurrencyMode.SYNC from get_models()? If we want direct testing to be async-only instead, we should update the helper/docs together and add coverage for the new expected usage.

README.md:29 — README still advertises the removed sync-engine fallback

  • What: The root README still tells users they can set DATA_DESIGNER_ASYNC_ENGINE=0 to fall back to the legacy sync engine, but this PR deletes engine/flags.py and removes the code paths that read that environment variable.
  • Why: After this lands in the major release, users following the README will think a rollback path exists when it no longer does; worse, setting the env var is now silently ignored by generation.
  • Suggestion: Update this section to say the async engine is the only execution path, and remove the DATA_DESIGNER_ASYNC_ENGINE=0 fallback language.

Suggestions — Take it or leave it

Style improvements, minor simplifications, or optional enhancements that would improve code quality.

scripts/benchmarks/benchmark_engine_v2.py:667 — Benchmark compare mode still toggles the removed env var

  • What: The benchmark script still uses DATA_DESIGNER_ASYNC_ENGINE to label subprocesses as sync vs async, and --engine sync now just runs the async engine with a different label.
  • Why: This can produce misleading speedup numbers by comparing async against async, especially because the script name and output still frame it as a dual-engine comparison.
  • Suggestion: Consider removing sync compare mode, making --engine sync fail with a clear message, or repurposing the script as an async-only benchmark.

What Looks Good

  • The core DatasetBuilder path is much cleaner with the sync branch removed, and the async scheduler wiring is now the obvious path through build() and build_preview().
  • Resume behavior kept the important crash-window safeguards: filesystem-derived row-group progress, immutable original targets, and terminal handling after process_after_generation().
  • The docs under architecture/ and Fern mostly track the new async-only execution model, and the focused tests around builder/resume/readiness give good confidence in the main path.

Verdict

Needs changes — I’d address the get_models() regression and README fallback reference before merge. The benchmark cleanup is optional but worth doing while this context is fresh.


This review was generated by an AI assistant.

@andreatgretel andreatgretel force-pushed the andreatgretel/refactor/remove-sync-engine branch from 231a656 to 2c2eb90 Compare June 23, 2026 23:24
@andreatgretel

Copy link
Copy Markdown
Contributor Author

@johnnygreco thanks for the review. I pushed 2c2eb906 to address the requested changes:

  • Kept get_models() on sync-mode clients so direct model.generate() usage still works for custom-column testing.
  • Removed the README fallback text for DATA_DESIGNER_ASYNC_ENGINE=0.
  • Simplified the benchmark script to async-only behavior so it no longer reports misleading sync-vs-async comparisons.

Greptile is green on the latest commit and the active checks are passing.

@johnnygreco johnnygreco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @andreatgretel. I rechecked the latest commit (2c2eb90) and the requested changes are addressed: get_models() now uses sync clients for direct custom-column testing, the README no longer advertises the removed async-engine opt-out, and the benchmark script is async-only. I also reran the focused checks locally (ruff check/format and 162 relevant tests) and they passed.\n\nOne separate note: GitHub currently reports the branch as conflicting with the base, so that will still need resolving before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants